My Github repository for my assignments can be found at this URL: https://github.com/chimandy/compscix-415-2-assignments.git
library(mdsr)
library(tidyverse)
library(ggplot2)
ggplot(data = mpg)
If I run ggplot(data = mpg), I see an empty grapth
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
There are 234 rows and 11 columns in mpg
?mpg
By running ?mpg, I found out drv varible descibes as below: There are basically 3 types of drv varible: f = front-wheel drive r = rear wheel drive 4 = 4wd
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
A scatterplot of class vs drv is not useful because class and drv are just two lists of models/types of cars with different drive transmission categories. Therefore, there is no correlation between these two categorical varibales without comparing them against any countinuous variables, it will make the points on the plot overllapped with one another.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
The above code is wrong in order to change the apperance of the plot to blue because the color argurment is inside aes() instead of ouside aes() and within geom_point (). Therefore, the correct code should be as below:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
In mpg dataset, a.Categorical variables: manufacturer, model, trans, drv, fl, class. b.Continuous variables:displ,year,cyl,cty,hwy.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
When you run mpg, categorical varibales are type
I. Continuous varibale:
a. color:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty))
b. size:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))
c. shape:
#ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
## mapping: x = ~displ, y = ~hwy, shape = ~cty
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
Error: There is an error showing “A continuous variable can not be mapped to shape”
II. Categorical variable:
a.color:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
b.size:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
c.shape:
ggplot(data=mpg)+
geom_point( mapping = aes(x=displ, y=hwy,shape=class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
Warning: there is a warning showing that “using size for a discrete variable is not advised”"
After mapping a continuous variable to color, size, and shape and mapping a catergorical variable to color, size,and shape, there are showing some major differences between these aestehtics. For example, for color, continuous varibales like cty are visualized on a spectrum of color to identify different types of city miles per gallon (eg. a spectrum of blue),wheresas catergorical variables like class are binned into discrete categories of different colors for different type of cars. In addition, for size, continous variables assign different sizes of dot corresponding to the values from smallest to the biggest. Meanwhile, for categorical variables, size defines different size to distinguish different types of catergories. For shape, there is an error showingf or a continuous variable, meanwhie, shape assign different types of shapes for different categories for categorical variable.
ggplot(data=mpg)+
geom_point( mapping = aes(x=displ, y=hwy, color=cty, size=cty))
If I map the same varible cty to multiple aesthetics like color or size, both aesthetics are implemented, and multiple legends are generated.
?geom_point
ggplot(data=mpg)+
geom_point(mapping=aes(x=displ, y=hwy), stroke=3, alpha=0.4, color='blue')
stroke adjusts the thickness of the border for alpha that can take on different colors both inside and outside.
ggplot(data = mpg)+
geom_point(mapping = aes(x=displ, y=hwy, color=displ<5))
If you map if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5, R executes the code and creates a temporary variable containing the results of the operation. Here, the new variable takes on a value of TRUE if the engine displacement is less than 5 or FALSE if the engine displacement is more than or equal to 5.
What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
By using faceting instead of the color aesthetic, there are advantages and disadvanatges as below:
a. Advantages :
Faceting splits the date into seperate grids for each differnt class and we can better visualizing and better micro-studying trend within each individual facet.
b. Disadvantages :
It’s harder to visualize the whole picture of the overall macro-relationship across all facets.
The color aesthetic is fine when your dataset is small, but with the enalarging and bigger datasets all points may begin to overlap with one another and the audience might get confused with insuficient presentation with a colored plot only when the datasets grows and just add additional color aesthetic to the plot.
?facet_wrap
nrow: It will define number of rows that facet plot should have.
ncol: It will define number of columns that facet plot should have
scales: Should scales be fixed (“fixed”, the default), free (“free”), or free in one dimension (“free_x”, “free_y”)?
shrink: If TRUE, will shrink scales to fit output of statistics, not raw data. If FALSE, will be range of raw data before statistical summary.
labeller: A function that takes one data frame of labels and returns a list or data frame of character vectors. Each input column corresponds to one factor. Thus there will be more than one with formulae of the type ~cyl + am. Each output column gets displayed as one separate line in the strip label. This function should inherit from the “labeller” S3 class for compatibility with
labeller(). See label_value() for more details and pointers to other options.
as.table: If TRUE, the default, the facets are laid out like a table with highest values at the bottom-right. If FALSE, the facets are laid out like a plot with the highest value at the top-right.
switch: By default, the labels are displayed on the top and right of the plot. If “x”, the top labels will be displayed to the bottom. If “y”, the right-hand side labels will be displayed to the left. Can also be set to “both”.
drop: If TRUE, the default, all factor levels not used in the data will automatically be dropped. If FALSE, all factor levels will be shown, regardless of whether or not they appear in the data.
dir: Direction: either “h” for horizontal, the default, or “v”, for vertical.
strip.position: By default, the labels are displayed on the top of the plot. Using strip.position it is possible to place the labels on either of the four sides by setting strip.position = c(“top”, “bottom”, “left”, “right”)
The reason that facet_grid() doesn’t have nrow and ncol as facet_wrap because facet_grid() forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data, whereas facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular.
a.Line chart - geom_line() b.boxplot - geo_boxplot() c.histogram - geom_histogram() d.are chart -geom_area()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Why do you think I used it earlier in the chapter?
Without show.legend = FALSE
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
With show.legend = FALSE
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
show.legend = FALSE removes the legend although The aesthetics are still mapped and plotted.When show.legend = FALSE is removed, the legend will shows a legend that explains which colors correspond to which values.It was used earlier in this chapter because it makes the comparasion presentation with the other 2 plots more clean and consistent.
It reconfirms the conditional arguement and determines whether or not to draw a confidence interval around the smoothing line.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
No because they use the same data and mappings to produce the same plot. The ony difference is that the first one can avoid duplication by passing a set of mappings to ggplot() function to automatically reapply the global mappings to each geom in the graph.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=4) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=4)+
geom_smooth(aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=mpg,mapping=aes(x=displ, y=hwy, color=drv))+
geom_point(size=4) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=mpg,mapping=aes(x=displ, y=hwy))+
geom_point(mapping= aes (color=drv), size=4) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=mpg, mapping=aes(x=displ, y=hwy))+
geom_point(aes(color=drv),size=4)+
geom_smooth(aes(linetype=drv), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=mpg, mapping=aes(x=displ,y=hwy))+
geom_point(color="white", size=8)+
geom_point(aes(color=drv),size=4)
?geom_col
geom_col is one of two types of bar charts. It makes the heights of the bar represents values of the data.Meanwhile, geom_bar is another tye of bar chart that makes the height of the bar propositional to the number of cases in each group depending the weight of each value versus the weight of the sum of all values.
The designer’s choices have some workable and some unworkable features. It works since he uses some bar charts to present some percenatges, and spectrum of blue color to differitate among diffrent values with very neat graphic design. It does not work since it lacks of other colors to better presentation and some studies/numbers are not presented in any plots or bar charts that it lacks convinceable presentation to the audience.I would use some bar charts or plots and other colors to make it more presentable.